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COMPLEXITY OF KRONECKER OPERATIONS ON SPARSE MATRICES WITH 
APPLICATIONS TO THE SOLUTION OF MARKOV MODELS 

PETER BUCHHOLZ*, GIANFRANCO CIARDOt, SUSANNA DONATELLI*, AND PETER KEMPER 5 

Abstract. We present a systematic discussion of algorithms to multiply a vector by a matrix expressed as 
the Kronecker product of sparse matrices, extending previous work in a unified notational framework. Then, 
we use our results to define new algorithms for the solution of large structured Markov models. In addition 
to a comprehensive overview of existing approaches, we give new results with respect to: (1) managing 
certain types of state-dependent behavior without incurring extra cost; (2) supporting both Jacobi-style 
and Gauss-Seidel-style methods by appropriate multiplication algorithms; (3) speeding up algorithms that 
consider probability vectors of size equal to the “actual” state space instead of the “potential” state space. 

Key words. Kronecker algebra, Markov chains, vector- matrix multiplication 

Subject classification. Computer Science 

1. Introduction. Continuous time Markov chains (CTMCs) are an established technique to analyze 
the performance, reliability, or performability of dynamic systems from a wide range of application areas. 
CTMCs are usually specified in a high-level modeling formalism, then a software tool is employed to generate 
the state space and generator matrix of the underlying CTMC and compute the steady-state probability- 
vector, from which most quantities of interest can be obtained as a weighted sum by using “reward rates” 
as weights [17]. 

Although the mapping of a high-level model onto the CTMC and the computation of the steady-state 
distribution are conceptually simple, practical problems arise due to the enormous size of CTMCs modeling 
realistic systems. Sophisticated generation and analysis algorithms are required in practice. 

In this paper, we consider the steady state solution of large ergodic CTMCs, that is, the computation 
of the vector 7r, where 7 r* is the steady- state probability of state i. However, our contributions can also be 
used to improve the generation of the state space [8, 21] and other types of analysis such as the computation 
of the expected time spent in transient states up to absorption in absorbing CTMCs and transient analysis 
of arbitrary CTMCs [6]. 

7T € IR' T I is the solution of the system of linear equations 
(1.1) 7T • Q = 0 subject to 7r • 1 T = 1, 

where Q is the generator matrix and T is the set of states of the CTMC. 

Direct solution methods such as the well-known Gaussian elimination are not applicable, since their 
fill-in results in excessive memory requirements. Iterative techniques based on sparse storage schemes for Q 
are more appropriate, but even they are memory-bound when applied to realistic examples. Virtual memory 
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Symbol 

Definition or properties 

Meaning 

M fc 


k - th submodel 

n k 

Tik > 2 

Number of local states for M* 

nr 

til,** 

Number of potential states for M[i jU j 

n 

ni 

Number of overall potential states 

Uk 

Tifrik = * n^^j 

Number of potential states when M k is ignored 

f k , T k 

{0, . . . ,n* - 1} 

Potential, actual local state space (s. s.) for M k 

ti k 

T 1 x T 2 x • • • x T* 

Potential s. s. for 

t,T 

t = f 1 K ,Tct 

Potential and actual overall s. s. 

Tf 

{*[!,*:] : , ![!,/<•] E T} 

Projection of the actual s. s. T on M[i >k ] 


{ifc ■ i[i,k] € Tx) 

Actual s. s. for M k when Mi is in state ij, VI < l < k 


* : f -> {0, . . . , \T\ - 1, null} 

Position of a state in lexicographic order 

7T, 7T 

Vt 6 T TTi = 7T 'J'(t) 

Steady state probability vector 

Q,Q,R,R 

Q = Qr,r,R — R T,T 

Infinitesimal generator, transition rate matrix 

h,h 

Vi E T h* = b*(i) 

Expected holding time vector 


Table 2.1 

Symbols used in the paper 


is of little help, since access times to virtual memory are too long to allow an efficient implementation of 
iterative solution techniques (although [11] reports some encouraging results). 

Recently, solution techniques for CTMCs have been developed that compute n without generating and 
storing Q explicitly. The idea is to represent Q as a sum of Kronecker products of smaller matrices that 
result from a high-level model structured into submodels. A general framework for this idea is described in 
Sect. 4. This covers several high-level formalisms to describe models in a compositional way [13, 14, 18, 24]. 

Solution methods exploiting a Kronecker structure are iterative and they differ from conventional iter- 
ation techniques in how they perform the required vector-matrix multiplications. Earlier approaches [24] 
employed the slowly- converging Power method, used dense storage schemes for the submodel matrices, and 
computed the solution using the “potential” state space, a (possibly much larger) superset of the actually 
reachable states. This resulted in a limited applicability, since these approaches become inefficient for models 
where submodel matrices are quite sparse, the actual state space is much smaller than the potential state 
space, or the Power method requires too many iterations. 

More recent publications [18, 9] began to overcome these limitations, but there is still no systematic 
approach to fully exploit the potential of Kronecker-based solution techniques. In this paper, wc present a 
family of solution techniques using sparse storage for the submodel matrices and iteration vectors of the size 
of the actual state space, and we consider both the Jacobi and the Gauss-Seidel methods. Additionally, we 
compare the complexity of the Kronecker-based vector-matrix multiplication algorithms we present, both 
theoretically and by means of a realistic example. 

In the next section, we define the notation used. Sect. 3 contains a framework for the description of 
Markov models and Sect. 4 considers composed Markov models and the corresponding Kronecker structure 
of the generator matrix. Sect. 5 presents and analyzes different algorithms to multiply a vector by a matrix 
represented as a Kronecker product, using a running example. Sect. 6 describes iterative solution approaches 
that use these multiplication algorithms to compute the steady-state solution of a CTMC. 
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2. Notation. Table 2.1 summarizes the symbols used in this paper. Except for the set of real numbers, 

IR, all sets are denoted by upper-case calligraphic letters (e.g., A ); row vectors and matrices are denoted by 
lower- and upper-case bold letters, respectively (e.g., x, A); their entries are indexed starting from 0 and 
are denoted by subscripts (e.g., x*, A ij); a set of indices can be used instead of a single index, for example, 
A x,y denotes the submatrix of A corresponding to set of rows X and the set of columns y. We also denote 
families of like-quantities with subscripts (for scalars) or superscripts (for sets, vectors and matrices) (e.g., 
X{ or x l ) and use a shorthand “range” notation to indicate sequences of them (e.g., = xi, . . . , x n ) 

77 [A] denotes the number of nonzeros in matrix A. 0 xxy and 1 xXy denote matrices with x rows and 
y columns, having all entries equal 0 or 1, respectively, while I x denotes the identity matrix of size x x x; 
the dimensions of these matrices are omitted if clear from the context. Given a vector x, diag(x) is a 
square matrix having vector x on the diagonal and zero elsewhere. Given annxn matrix A, rowsum(A) = 
diag ( A - l nxl ) is a matrix having the diagonal equal to the sums of the entries on each row of A, and zero 
elsewhere. 

We recall the definition of the Kronecker product A = ®f =1 A fc of K square matrices A k E 2R nfcXn *. 
Let n" = nLi n *> n = > and n* = njn^. If we assume a mixed-base numbering scheme, the tuple 1[i,k] 

corresponds to the number (. . . ((/i)ri 2 + / 2)^3 * • ')tik~\-Ik = SfcLi ^ n fc+i (letting = 1), and vice versa. 

If we assume that i[i t K] and j\i,K] are the mixed-based representation of i and j, respectively, the generic 
element of A € M nxn is 

A*>j “ = ' ‘ ’ 

The Kronecker sum 0^ A k is defined in terms of Kronecker products, as 
K K K 

^^A = Im ® ’ ' * ® -l-nfc-i ® A ® In fc + i ® ® ® A* ® ’ 

k= 1 k= 1 k=l 

We are interested in algorithms that exploit sparsity. For the Kronecker product, the number of nonzeros 
is *7[&fcLi A k ] = Uk=x *7[A*]. For the Kronecker sum, diagonal entries in the matrices A fc might result in 
merged entries on the diagonal of A, thus we can only bound the number of nonzeros, ^[©^Lj A k ] < 
SfeLi( 7 ?[A fc ] • frfc)* This bound is achieved iff at most one matrix A k contains nonzero diagonal entries. On 
the other hand, if all diagonal elements of the matrices A k are positive (or all are negative), r?[0£ Li A*] = 
£f=ifa[A fc ] • rifc) — (K — 1) * n. As a consequence, the Kronecker sum of K > 2 matrices (with > 1) can 
never be a full matrix. 

3. Description and solution of a Markov model. A formal high-level model having an underlying 
CTMC specifies a stochastic automaton of some sort. Instead of assuming a specific high-level formalism 
such as Stochastic Petri Nets [5], Queueing Networks [26], or Stochastic Process Algebra [16], we consider a 
generic framework where a model is a tuple: 

M = (T, E , i° ) active, new , ra£e, weight) 

• T is the set of potential states. 

• E is the set of possible events. 

• i° E T is the initial state. 

• active : E x T — + {True, False}. 

• new : E x T — ► T U {null}; new(e,i) = null iff active(e, i) = False. 

• rate : E — ► JR + . 
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• weight :£xf-> JR + . 

If an event e is active in state i (that is, if active(e,i) = True ) and if new(e,i) — j, we say that state 
j is reachable in one step from i. The transitive and reflexive closure of this relation is called reachability. 
Using the reachability relation, we can build the set T C T of states reachable from the initial state i°, 
called the (actual) state space of M . An event e active in state i has an exponentially-distributed duration 
with parameter rate(e) • weight(e, i). Since the duration of the events is exponentially distributed, and the 
next state depends only on the current one (by definition of function new), M defines a CTMC. 

T can be generated using a state-space exploration algorithm, essentially a breadth-first search of the 
graph implicitly defined by the model, starting from i°; assuming that T is finite, the search terminates. 
Then, we can define a function ^ : T — ♦ {0, . . . , |T| — 1, null}, with ^(i) = null iff i £ T and such that the 
restriction of $ to T is a bijection. 

With the indirect iterative solution methods for the steady-state solution of the CTMC, it is convenient 
to write the infinitesimal generator Q E ]Rl T I x l T 1 as 

Q = R — rowsum( R) = R — dta^(h) -1 , 


where R E is the transition rate matrix and h is the vector of expected holding times. R differs 

from Q only in its diagonal, which is zero, while h contains the inverse of the negative of the diagonal of Q. 
R can be generated at the same time as T, or in a second pass, once \T\ is known, since we can then use an 
efficient sparse row- wise or column- wise format [23]; h is instead stored as a full vector. 

In the following, we assume that R is irreducible, that is, the CTMC is ergodic. 


4. Model composition and Kronecker description of Q. We consider structured models de- 
scribed as the parallel composition of a set of stochastic automaton submodels. Formally, a stochastic 
automaton M = (T, £, i°, active, new, rate, weight) is the parallel composition of the K models Af* = 
(T k ,£ k , active k , new k , rate k , weight k ), 1 < k < K, iff: 

• T = T 1 x . . . x T K , A state of the composed model is the tuple describing the local state of the K 
submodels: i = 

• £ = (Jf =1 Let £s = {e : 3h ^ k : e E £ h Pi € k } be the set of “synchronizing” events common to 
two or more submodels. The remaining events can be partitioned into K sets of “local events” , one 
for each submodel: £ k = £ k \ £s- 

A submodel Af* has no influence on an event e £ £ k , nor is it influenced by such events. Hence, we 
extend the local functions active k , weight k , new k , and rate k to the entire set of events £ as follows. 
For any e £ £ k and for any local state i * E T k : 

— active k (e,ik) = True (neutral element for logical conjunction). 

— weight k (e,ik) = 1 (neutral element for multiplication). 

— new k (e,ik) = ik (neutral operation for state changes). 

— rate k (e) = oo (neutral element for the minimum). 

• - 2 ° 

• 1 ~ l [i t K]- 

K 

• active{e,i] [ i K]) — f\ active k (e,ik)- An event local to M* is active iff it is active in Affc; a synchro- 


k = l 


nizing event e is active iff it is active in all submodels where it is defined, 
null if 3k, active k (e,ik) = False 
j[i,K] otherwise 


new(e,i [hK ]) = 


, where jk = new k (e,ik ). 
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K 

• weight(e } i[ liK ]) = weight* (e>i k ). 

k=i 

• rate(e) = ^ rain^{rate fc (e)}. The choice of using the minimum is just one possibility; for the solution 
algorithms we present, any function will do. For example, we could instead assign the rate on the 
global model, independently of the values rate k (e). 

In the presence of multiple synchronizations, T can be a strict superset of T, and, in this case, some 
local states in T k might be unreachable. Thus, we define the “actual local” state spaces as the projection of 
T on the fc-th component: 


T k = {i k : 3 j[ itK ] eT,j k = i k } C T k , 

We can then redefine T k as T fc , and assume from now on that the two are identical. This improves both 
memory requirements and execution time, at the cost of exploring T once. 

The compositional definition of M allows a structured description of the transition rate matrix based 
on the following “local” matrices: 

• W k (e) is a n k x n k matrix defined as 

^ I weight* (e ,1k) if active k (e,i k ) = True and j k = new k (e,i k ) 

W «- ( ' )= jo otherwise 

• R fc are the local transition rate matrices describing the effect of local events: 


R fc = rate(e) • W fe (e). 
eesl 

Using related frameworks, [3, 9, 14, 24, 25] have shown that both the transition rate matrix R and 
the infinitesimal generator Q underlying M can be expressed as the restrictions to the reachable states of 
appropriate matrices, R = Rr,r and Q = Qr,r, defined as Kronecker expressions on the W fc (e) and R fc 
matrices. The expression for R is: 

K 

(4.1) R = £ rate(e) * 0W fe (e) + 

fc=l 

. . -> 

synchronizing events local events 

The expression for Q is analogous but we omit it because, in the following, we assume that only R is kept 
in Kronecker form, and that h, or its “potential” version h, is stored explicitly as a full vector. Alternatively, 
we could save memory by using a Kronecker description for h, at the cost of additional execution time. 

4.1. A small example. To illustrate the use of Kronecker operators for the description of R, we 
consider first a simple example with K = 2 models Mi and M 2 . A more complex running example is 
introduced in the next section. 

The definition of Mi and M 2 is given in Table 4.1, the extension of functions to the entire set £ is 
omitted for readability. The composed model M has a state space T = {(0,0), (0, 1), (1,0), (1, 1)} and its 
events are £ = £s > where £\ = {ei} is the only local event in Mi, £\ = {e 2 } is the only local event 

in M 2 , and £s = {e3} is the only synchronizing event. If we define rate(ei) = 6, rafe(e 2 ) = 5, rate(e 3) = 4, 
we obtain the following matrices (zero entries are omitted for readability): 
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model M\: 

^ = {0,1} 

€ 1 = {ei, ^3} 

active 1 (•, ■) — False except 

active 1 (ei, 0 ) ~ True new 1 (e 1 , 0 ) = 1 

active 1 (e 3, 1 ) = True new *(e3, 1 ) = 0 

weight 1 (ei, 0) = 0.6 
weight 1 (e 3, 1 ) = 0.4 

model M2: 

r 2 = { 0,1} 

£ 2 = {e 2 ,e 3 } 

active 2 (•, •) = False except 

active 2 (e 2, 1 ) = True new 2 (ei , 1 ) = 0 

acrit>e 2 (e3, 0 ) = True new 2 (e3, 0 ) = 1 

weight 2 (e2j 1 ) = 0.7 
weight 2 (e 3,0) = 0.5 

Table 4.1 

Definition of the submodels for our small example 

0 1 

0 1 

0 1 


0 


9 0 

0.5 

1 0 

6 0.6 

~ 0 




W 2 e 3 )= 


R = 


r 2 = 


1 

0.4 

V ' 1 


1 


1 

5 - 0.7 


resulting in 

2 2 

R = 4 - 0 W fc (e 3 ) + 0 R* = 4 - (W 1 (e 3 )®W 2 (e s )) + (R 1 0 I 2 + h 0 R 2 ) - 

fc=l 


( 0 , 0 ) 

(0,1) 

( 1 , 0 ) 

( 1 , 1 ) 



4 .2. A running example. We now describe the running example used throughout the rest of the 
paper to obtain timing results. It models a flexible manufacturing system (FMS) with three machine centers 
(ci, C2, and C3) and four types of parts being processed. Fig. 4.1 depicts it as a fork-join queuing network, 
with machines as queues and parts as customer classes (chains). We assume exponentially distributed service 
times for simplicity. C3 can process up to three parts in parallel, C2 up to two, and c\ only one. 

A part of type A accesses c 3 with high priority and, after service completion, it is either rescheduled for 
processing at C 3 or joined with a part of type B for processing at C 2 . A part of type B accesses C\ with high 
priority and, after service completion, it is either rescheduled for processing at c\ or joined with a part of 
type A for processing at C2. The joint processing of parts of type A and B occurs with high priority on C2 
and, after service completion, it yields a product which is delivered and replaced by its original raw parts, 
to keep a constant stock of material in the system. 

The FMS also produces a second product with low priority, to reduce idle time on machines. The 
low-priority product is processed in the same manner as the high-priority product, but from parts of type 
C (instead of A) and D (instead of B). The only difference is that processing of the corresponding parts 
can only take place on a machine that has no high-priority work to be performed (we assume a preemptive 
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• 

part A 

♦ 

part B 

0 

part C 

0 

part D 

■ 

part A and B 

□ 

part C and D 


Fig. 4.1. Multiclass Queueing Network for our running example 


Type of matrix 

e 

k = H 

k = L 

W fc (e) 

low 1 

1,158 

584 

W fc (e) 

low 2 

2,259 

135 

W k (e) 

lows 

2,214 

584 

R k 

local 

11,844 

1,438 


Table 4.2 


Number of nonzeros using the first decomposition. 


priority policy). The parameters ua , tiq, nc, and no give the number of parts of each type present in the 
system. 

We start with decomposing the model into two submodels according to the priority of parts. Submodel 
H describes the processing that machines Ci, C 2 , and C3 perform on the high-priority parts A and B, including 
their joint processing; submodel L is analogous, for the low-priority parts C and D. 

The synchronizing events are 8s = {I 0 W 1 J 0 W 2 , lows], representing the low-priority parts using the three 
machines. For ua = n# = 4, and nc = no = 3 we obtain \T H \ — 2,394, \T L \ = 652, and \T\ = \T\ = 
1,560,888. Table 4.2 gives the number of nonzeros for the matrices involved in the Kronecker description of 
R 

If matrix entries are stored in double precision, the Kronecker description of R requires 388,800 bytes. 
The explicit sparse-storage representation for R would instead require about 126 MB in single precision or 
180 MB in double precision. Obviously, the Kronecker representation of R is extremely space-efficient in 
this case. 

5. Complexity of vector-matrix multiplication. If A is a n x n matrix stored explicitly using sparse 
storage, the complexity of computing the product x • A is 0(ry[A]). Storing A in a full two-dimensional 
data structure is inefficient for the type of problems we consider; in any case, it is equivalent to assuming 
that 77(A) = n 2 from a complexity point of view, so we restrict ourselves to sparse storage from now on. If 
A is instead stored implicitly as the Kronecker product of K matrices A* € IR nfcXnfc , k € {1, . . . ,K}, also 
stored in sparse row- wise or column- wise format, as appropriate, a direct application of Eq. 2.1 requires K 
operations to obtain each matrix entry. If each A ij is computed only once for each pair (i, j) and only 
nonzero A ij elements are computed, the complexity of computing x • A becomes O (K * r?[A]). In the next 
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njijsq, A^ 1,K h inout: x, y); 

1. 

nuft 4— 1; 

2. 

bright * Tlj r 

3. 

for k — 1 to K 

4. 

base 4— 0; 

5. 

jllTfip 4 Tlfc • n r ight 1 

6. 

if A k / I then 

7. 

for block — 0 to m e /t — 1 

8. 

for offset — 0 to n r i 9 ht — 1 

9. 

index 4 — base + offset ; 

10. 

for h = 0 to n* — 1 

11. 


12. 

index index + Wright 

13. 

z' 4— z • A k ; 

14. 

index 4— base + of fset\ 

15. 

for h — 0 to rijt — 1 

16. 

y»ndex 4 %h I 

17. 

tndex 4 zndex 1 TZrfi^Tit 

*-* 

00 

base 4— base + jump; 

19. 

x <- y; 

20. 

nieft * Tlleft * Tifc , 

21. 

bright * n r i g ht / 1 


Shfft* (\rv. x, nuf t , rik, n r i g ht, A fc ; inout: y); 

1. base <— 0 ; 

2. JUTTip 4 Ti|c ■ 

3. for block = 0 to nu/t — 1 

4. for offset — 0 to — 1 

5. index 4— 6ase + offset', 

6 . for h = 0 to 7ik — 1 

7. Zfc ^ Xindezi 

8. index <— index -j- n r i g ht] 

9. z' <— z*A fe ; 

10. index <— 6ase 4 - offset ; 

11. for /t = 0 to rife — 1 

12. yindex * yindex H“ 

13. index <— index + n rig ht ; 

14. 6ose base -f jump; 


Fig. 5.1. Vector-matrix multiplication using perfect shuffles. 


section, we recall an algorithm that achieves a better complexity by exploiting the inherent structure of a 
Kronecker product. 

Before doing so, however, we observe that the matrices involved in the Kronecker products of Eq. 4.1 
can have very few elements. In our particular application, the matrices A k are either the K matrices W fc (e) 
for a given e E £ , or they are all the identity except for one of them being equal R fc , for a given k. This last 
case arises from rewriting the Kronecker sum appearing in Eq. (4.1) into a (regular) sum of of Kronecker 
products, as explained in Sect. 2. Hence, we consider three levels of sparsity, according to the average number 
a = r)[A k ]/nk of nonzeros per row or column in the matrices A k (in the following we assume the same a for 
all matrices A fe ), where A = 

hypersparse: a <§: 1 77(A) <C n (only a few nonzeros, most rows and columns are empty), 

ultrasparse: a « 1 => 77(A) « n (each row or column has one nonzero, on average), 
sparse: a 1 77(A) = n • a K n (any other sparse matrix). 

We focus on the case of sparse or ultrasparse, that is, we assume that 77(A fc ] > for all k = 1, — , K, 
Truly hypersparse matrices can occur in our Kronecker approach, clearly extreme cases might be best man- 
aged by explicitly storing a list of triplets (z, j, A^j), one for each nonzero in A. 

5.1. The shuffle algorithm. The first algorithm for the analysis of structured Markov chains was 
presented in (24). Fig. 5.1 shows algorithms Shffl , to compute y <— x • 0^L 1 A fe , and ShffV~ , to compute 
y 4 — x • I n fc-i 0 A k 0 I n K (from now on, the suffix w -b” denotes the version for the simpler case of a 
product where all matrices are the identity except one). 
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Fig. 5.2. Comparing Shffl with ordinary multiplication in the (A", a) plane. 


Shffl considers the matrices A fc sequentially, exploiting the equality [10]: 

K K 

< s ‘> ® a * = n s f» ■ ("». ® A ‘> - s w.<,). 

fc=l 1 

where S( aj &) e {0, l} a ' 6xo ' 6 is the matrix describing an (a, b) perfect shuffle permutation: 

J 1 if j = (i mod a) • b -f (i div a) 

^ a ' b ^ | 0 otherwise 

Therefore, a vector- matrix multiplication can be performed using K vector permutations and K mul- 
tiplications of the type x • (In fc ® A k ). Matrix I nk ® A k has a peculiar structure: it is simply matrix A k 
repeated njt times over the diagonal, hence, the cost of the k- th multiplication is 0(n^ ■ r;[A fc ]), while the 
permutation costs can be neglected, since they can be incorporated into the algorithm. 

The computation of the shuffle permutation is encoded in steps 10-13 and 15-18. The complexity of Shffl 
(derived in [4] as a generalization of the full storage case described in [26, Theorem 9.1]) can be rewritten 


O 



= 0(n-K -a). 


Hence, Shffl is faster than a multiplication using explicit storage iff 

nKa<na K a>K 1 ^ t 


Fig. 5.2 illustrates the regions where Shffl or ordinary multiplication perform better, according to the values 
of K (which is small in practical modeling applications) and a. 

Shfffl is the specialization of Shffl when A k ^ I is true for exactly one fc, and it is called with its 
parameters ni e j t and n r i g ^t set to n k ~ l and nj^j, respectively (Fig. 5.1). Its complexity is 

O (h k ■ 77[A fc ]) = O (n ■ a) . 

The resulting complexity of computing y <— y + x • ©£Li A* using Shfffl is then 
°(j2^-v[ Ak ^j = O == 0(n- K ■ a) . 
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Fig. 5.3. Vector-matrix multiplication by rows . 


5.2. A straightforward algorithm using sparse storage. A direct application of Eq. (2.1) results 
in the algorithm Rw in Fig. 5.3, which performs the computation y <— y + x • A and requires sparse row-wise 
format for the matrices A k . 

Procedure RwEl computes the contribution of a single entry x* to all the entries of y, as y y + 
ii * A i f. The K nested for statements in procedure RwEl are required to compute element A ifj according 
to its definition in Eq. (2.1). Some of the computation needed to obtain A it j is reused for other elements 
of matrix A. On a given call to RwEl , statement a* <— ak - i * A* fc|J - fc reached n*=i = ^{a k ) 

times. Rw calls RwEl n times, hence its complexity is 

/ ^ \ f /''I f TS\ A 1 nl+foorvoTon 


f 0(n h 
[ 0(n * a 


K) = 0(Kr 1 { A]) 

r y K ) = <5(r?[A]) 


( * \ f Ofn - K) = 0(K -T)[ A]) ultrasparse 

< 5 - 2 > ° (»•£“*)-( ^ 

In other words, the multiplications needed to compute a 2 through a^-\ are effectively amortized only if 
a> 1. 

The analogous algorithm Cl and procedure ClEl, for multiplication by columns, are omitted. Each call 
to ClEl computes a single entry y j of y as the inner product x • Aj-^, where j is equal to j[i t K] i n our 
mixed-base notation. Cl has the same complexity as Rw but requires sparse column- wise storage for the_ 
matrices A^. However, Rw and Cl differ in an important aspect: the multiplication performed by ClEl 
requires only one scalar accumulator, while RwEl uses the vector y itself as an accumulator. If we follow 
the good numerical practice of using higher precision for accumulators, Rw has larger memory requirements 
than Cl. 

The simplified multiplication algorithm Rw + > used to compute Kronecker sums, is also shown in Fig. 5.3. 
Its complexity is 

/ i?[A fc ]\__, . 


O ( n ■ 


= O (n • a) . 


The resulting complexity of computing y 


rik J 

y + x • ©£ =1 A k using Rw+ is then 
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Type of matrix 

e 

BED 

k = 2 

k — 3 

k = 4 

W fe (e) 

low 3 


— 

42 

— 

W k (e) 

low 2 


— 

21 

— 

W k (e) 

low 1 

1 . 

35 

— 

30 

W fe (e) 

joiriAB 


35 

— 

— 

W fc (e) 

fork AB 


35 


— 

W fc (e) 

joincD 

— 

— 

21 

15 

W fc (e) 

forkcD 

— 

— 

21 

15 

R fc 

local 

280 

140 

42 

30 


Table 5.1 


Number of nonzeros using the second decomposition. 


5.3, Considering only a subsets of states. The decomposition considered so far for our running 
example satisfies T = T. This is not the case in a second decomposition we now introduce, obtained by 
further refining the submodels H and L . For the high-priority parts, we define submodels 1, describing the 
processing of parts of type A and their joint processing with parts of type B, and 2, describing the processing 
of parts of type B on machine c \ . For the low-priority parts, we define analogous submodels 3 and 4. 

The synchronizing events are then 

£s = {I0W1J0W2J0W3, join ab, fork ab, join C D, forkco} 

where the “join” and “fork” events correspond to the start and end of assembling parts A with B, or C with 
D , respectively. 

For tia — tib = 4, and nc = no = 3, the cardinalities of the local state spaces are IT 1 ] = 126, \T 2 \ = 
70, |T 3 | = 56, and |T 4 | = 35. The potential state space is now much larger, |7j = 17,287,200, while |T| = 
1,560,888, as before, since we are modeling the same system. Table 5.1 gives the number of nonzeros for the 
matrices involved in the Kronecker description of R (missing entries indicate identity matrices, which do not 
need to be stored explicitly). 

The matrices for the Kronecker description of R now use a truly negligible amount of memory, 29,148 
bytes, but Shffl , Rw, and Cl need a large amount of space to allocate vectors of length T, even if we are 
really interested only in the elements corresponding to T. 

This observation motivates us to consider methods that compute y T = y T -f x r • A t,t> where T C T 
is the actual state space and y r and xj are stored using arrays y and x, of size |T|. Specifically, for i £ T, 
Xi is stored in position I = 'l'(i) of x: x* = x/. To restrict ourselves to reachable states only, we need to: 

• Generate T. Efficient algorithms for the generation of T can be found in [8, 21] 

• Ensure that only A r,r contributes to the value of y. If A is one of the matrices whose sum 
constitutes R, then A = 0 whenever i £ T and j £ T, that is, starting from a reachable state, 
only other reachable states can be reached. In matrix form, this implies that R^ r\r ~ 0 [9, 18]. 
The reverse is not true: if i £ T and j e T, A ij can be positive, that is, reachable states can be 
reached from unreachable states. 

• Find an efficient way to compute ¥ : T — ► {0 , . . . , \T\ — 1, null}. We use a logarithmic search in T, 
and show how the overhead is reduced by using an appropriate data structure to store T. 

Algorithm RwSb\ in Fig. 5.4 modifies Rw, by accessing only elements corresponding to reachable states. 
We omit the algorithm RwSb* to compute Kronecker sums, since an analogous discussion as for Rw + 


ll 



RwElSb\(\n: inout: y) 

i?u/S&i(in: x, A^ l,K \T\ inout: y) 

1. for each ji s.t. A* l01 > 0 

1. 

for each i[i,K] € T 

2. ai <— 

2. 

I - ♦(•u.*]); 

3. for each j 2 s.t. Af 2jj2 > 0 

3. 

RwElSb\{i[\,K] »x/, , y); 


4. Q>2 * Ql ‘ •^•*2,^2* 


5. for each Jk s.t. A^ K j K > 0 

6. a/c *— aK-\ A^ R jK \ 

7. j <- ^(j[i lK] y, 

8- y j <— yj + ^ Qjc; 

Fig. 5.4. Vector-matrix multiplication by rows for a subset T of the states. 


applies. Line 1 in RwSb i selects only elements of T among those in T. This requires no additional overhead 
as long as the elements of T can be accessed sequentially according to the order $?. The assignment in line 
7 of RwElSbi, however, requires finding the index J = of the element in the array y. If 

the computation of # uses a binary search, a multiplicative overhead factor 0(log \T\) is encountered in the 
innermost for- loop. The overall complexity of RwSb\ is then derived analogously to that of Rw. On a given 
call to RwElSbi, statement a k <- a fc _i • A k kJk is reached Uh=i * 7 ^ T „] = 0{a k ) times, and statement 
J <— ^(j[i,K]) is reached r/[A 1[1K] ,r] = 0(a K ) times. RwSBi calls RwElSbi once for each i[i,K] € T > 
hence its complexity is 


(5.3) 


K 


O m- £> 


* + a K 


log \T\ 


\k= 1 


O (|T| ■ ( K + log IT'D) ultrasparse 
O (|T| ■ a K • log |T|) sparse 


Since K < log |T| in practical modeling situations, we can conclude that RwSbi has a log |T| overhead 
with respect to ordinary multiplication, regardless of the matrix sparsity. 

In a multiplication by columns, the situation is potentially worse. CISbi, the version analogous to 
RwSbi, must avoid the “spurious” entries in In ClElSbi , the index I <— ^ computed by 

the binary search returns null if the “from” state i[\,K] * s n °t reachable. Hence, ClElSbi must test whether 
I = null and, if so, ignore entry A l[x K] . The average cost of these binary searches is slightly longer than 
for RwElSbi, since searching a state not in T represents a worst-case scenario, but, more importantly, the 
complexity of CISbi must account for all searches performed, regardless of whether they are successful or 
not. The number of such searches is equal to the number of nonzeros in the columns of A corresponding to 
T, r}[Af T ), while only t}[At,t] searches are performed in RwSbi • The sparser Af^ rr is, the closer the 
performance of CISbi is to that of RwSbi, and the two complexities coincide when A t\t,t ~ ® (this can 
happen even when T dT), 


5.4. Reducing the log |T| overhead. The multiplicative overhead log |T| in RwSbi and CISbi results 
from a worst-case assumption that we must search in a set of size |T| to compute each value of 

[19] discusses approaches to reduce this overhead, but the most effective solution appears to be obtained 
by storing T using the multilevel data structure shown in Fig. 5.5 [8]. 

Before explaining this data structure, we introduce the following sets: 

• 7* = T 1 x * * • x T k r= {0 , . . . , ui ~ 1 } x * * ■ x { 0 , . . . , Tik - 1}, the projection of the potential state 
space over the first k components. 

• Ti = {i[i k] : 3*[fc+i,JC]»i[i,jr] e T}, the projection of the actual state space over the first k compo- 
nents. 
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♦ x ¥ K (*"'i , 2 C¥ l (i i ),i 2 )—,i K ) = x V(i [l<K }) -► 


Fig. 5.5. Storage scheme for computation of $ in 0(log|T|). 

• ^ fc (i[i,fc-i]) = {*fc : *[i,fc] e T k } y the local states for Mk that can occur when the local states for M\ 
through Mk - 1 are i\ through i*_i, respectively. 

In particular, 7\ K = T, T X K = T, and we can define T 1 (i[i ) 0 ]) simply as T 1 . 

In Fig. 5.5, the elements of the array at level k contain local states for submodel A/^. When searching 
for a given state i[i t K}> we search first for ii in the array at level 1, containing T 1 . After finding ii, we 
follow its pointer to the array at level 2. The greyed-out portion of this array contains 7^(ii). We then 
search for i 2 in this portion, and so on, until we find the local state ix in the greyed-out portion of the last 
array, corresponding to T K (i[ \ y x-i])‘ The displacement of this local state in the entire array at level K is 
(i[i J K'])» If, at any level, we are unable to find we can conclude that the state we are searching is not in 
T, that is, ^( 1 ( 1 ,^]) = null. 

Since the arrays at levels 1 through K — 1 are usually small compared to the last level, and the array at 
level K % of size |T|, can be compressed into [log 2 nx] * \T\ bits, T can be stored in 0(\T\ ■ log 2 nx) bits [8], 

The real advantage of this data structure, however, is the amortization of the logarithmic searches. For 
a given i[i t x], we compute in K steps: 

Wlien searching for a second state ijj such that i[i t k] ~ i[i *.], we can reuse the work in the first k of these 
steps. In other words, if we saved the pointers identifying the greyed array for T k+l (i[ at level k + 1 
where we found ik+i, we can now start our search for i f k+1 in that portion, instead of starting all over at 
level 1. 

This results in algorithms RwSb2 and C7S&2- Fig- 5.6 shows C75&2, since it is used in the solution 
algorithm A-GSD of Sect. 6 . The tests for Ik ^ null for k < K are necessary because an event might be 
inactive due to a submodel M \ with h > A;, but there might not be a matching state for k{Jk-i, jk) at 
level k already. This is possible not only for ClSb2> but also for RwSb2 l . In addition, ClSb2 still is affected 
by nonzeros in A^ r , as discussed in Sect. 5.3, requiring the test lx ^ null in the innermost loop. The 
analogous test is instead not needed in RwSb2> 

1 We thank A. S. Miner for pointing out this possibility and providing an example where it occurs. 
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The complexity of ClElSb 2 is now dominated by the searches at each level, since for each multiplication 
at level fc, a 0(log n*) search is performed as well. On a given call to ClElSb 2 , statement 7* ®k(A-i, ifc) 

is reached 77 (<0^ A h ] k _ ) = 0(a k ) times. ClSb 2 calls ClElSb 2 for each j[\ t x] £ ^ hence its 

|> -It* 1 xT k J [ltk] J 


complexity is 


(5.4) 


O 


(\T\ V a * lo n ) = / ° (l T l ' £*=i log nfc ) = ° (Tl ' log n ) dtrasparae 
\ fc= i / 1 0{\T\ a K logn/c) sparse 


(the simplification for the sparse case is correct provided that tik is at least comparable to for any 
k < K). 

The complexity of RwSb 2 is analogous, except that the amount of computation performed in ClSb 2 is 
still worse than for RwSb 2 because searches for unreachable states are still possible. The problem is now 
reduced with respect to ClSb \ , though, because entries in A f\ TT are now discovered before reaching the 
innermost loop, if *[i,fc] ^ T k for some k < K. 

Comparing Eq. (5.4) with Eq. (5.3), we conclude that RwSb 2 and ClSb 2 have better complexity in the 
sparse case, since they can effectively amortize the logarithmic searches at each level only when the matrices 
A k have multiple elements per row. However, in the ultrasparse case, they have an overhead of log n = log \T\ 
instead of just log |T|. This is due to the pessimistic assumption that a logarithmic search at level k requires 
O(lognfc) comparisons. If, for each k , all sets T fc (i[ 1)fc _ 1 j) were of approximately the same size, then their 
complexity in the ultrasparse case would be reduced to 0(|T| * log|T|), but this might not be the case in 
practice. One factor not evidenced by the complexity expressions, though, is that, unlike RwSb 1 and ClSb \ , 
RwSb 2 and ClSb 2 avoid reaching the innermost for-loop whenever a partial state is not reachable, so they 
might actually perform better than RwSbi and ClSb\ even in the ultrasparse case. 

RwSb J and CISb J are the simplified algorithms for matrices arising from a Kronecker sum. Fig. 5.6 
shows CZS&2 ■ O n a given call, ClElSb 2 f reaches statements h <— ^k(h-iJk) and Ih <— ®fc(/h-i, jh) for 
k < h < K y with an overall cost of O (j2h=k at m ost T}[A r k tjk ] = 0(a) times. CISb % calls ClElSb J 

once for each j[i y K] € T, hence its complexity is 


O ^|T|-a -^lognfc^ . 

The resulting complexity of computing y *— y + x •(©£.. A*) using ClSb~2 is then 
O (j:\T\ a • £logn^ = O ^|T| ■ a ■ • logn fc ^ = 0{\T\ ■ a ■ K ■ logn) . 


h=k 


Thus, the logarithmic search overhead decreases from submodel Mi, where no amortization occurs, to M/c, 
where only an overhead log Hk is encountered, but the overall overhead remains logn, since the number of 
nonzeros in 0^L 1 A k is approximately |Tj • a * K. 


5.5. Interleaving rows and columns. RwSb 2 and ClSb 2 fail to amortize the logarithmic searches 
in the case of ultrasparse matrices because the K nested for-loops in RwElSb 2 and ClElSb 2 consider only 
the entries on a given row or column of A, respectively. If A is ultrasparse, only one entry is found, and no 
amortization occurs. 

To further improve the complexity, we need to consider the elements of A in a different order, interleaving 
the K nested for-loops in RwElSb 2 , or ClElSb 2 , with those implicitly defined by the single for-loop in RwSb 2 , 
or ClSb 2 . 
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ClElSb 2 (\n\ J[i,k]>x, A^ 1,K T inout: y) 

1. 

for each n s.t. A^ ^ > 0 

2. 

I\ <— 

3. 

if /1 ^ null then 

4. 


5. 

for each i 2 s.t. A 2 3>J2 > 0 

6. 

1 2 «— ^ 2 ( 71 , 12 ); 

7. 

if 72 / null then 

8. 

a 2 <— oi -A? 2i j 2 ; 

9. 

for each i K s.t. A*. iK > 0 

10. 

Ik «— ^k(7k-i, ik); 

11. 

if Ik / null then 

12. 

ax <— ok-i * A^ k j k ; 

13. 

y y + x/ K ok; 

C7£7>2(in : x,A^’^,T; inout: y) 

1. 

for each j[i,K] 6 T 

2. 

J «- ®0'[i,ki); 

3. 

ClElSb2<J[i,K],x, Al'' K ly j); 


ClElSb^ (in: .?[i s k] » x, fc, A fc ; inout: y) 

1 . 

7 fc _i <— . . ^2(^i(ii),i2) • . . 

2. 

for each i k s.t. A* fc0fc > 0 

3 . 

Ik i ( 7 ^ — 1 , ifc) ; 

4. 

if I k 7^ null then 

5. 

Jfc+i ^k+i(Iky ifc+i); 

6. 

if 7 fc + i 7^ null then 

7. 

7 k <— ^k( 7 k-i, Jk)', 

8. 

if Ik 7^ null then 

9. 

V <- V + xik Aj tii(k ; 

ClSb^i in: x, fc,A fc ,T; inout: y) 

1 . 

for each j^k] € T 

2. 


3 . 

ClElSb% , x, fc, A fc , yj); 


7 ?tt; 563 ( 

n: x, A inout: y) 

1 . 

for each i\ 6 T 1 

2 . 

/1 «- 'Mil); 

3. 

for each j\ s.t. A * 1j>71 > 0 

4. 

Ji «- Mil): 

5. 

if J\ 7 ^ null then 

6 . 

ai .ji ' 

7. 

for each 1*2 6 T 2 (n) 

8 . 

I 2 ♦— ^2(71,22); 

9. 

for each s.t. A 2 2 ^ > 0 

10 . 

J2 <— ^(Ji , .72); 

11 . 

if J 2 7 ^ null then 

12 . 

^2 * ai ■ A^ 2 ) j 2 , 

13. 

for each i K 6 (*[i f K-i)) 

14. 

7 k <— 'J'k^k-i, ik); 

15. 

for each j K s.t. A f c K j K > 0 

16. 

Jk <— #k(Jk-i,,7k); 

17. 

if J K zfz null then 

18. 

o>k <— clk- 1 ■ A^ k j k ; 

19. 

yj K *- y j K +x /jf • ok; 


RwSb^(\n: x, k ) A k t T\ inout: y) 

1. for each i[i,fc_i] 6 T^ -1 

2. Ik-i <— . . ^ 2(^1 (ii), 12 ) ■ • • ,ifc-i); 

3. for each i k £ ^’ fc (*f ll jb_ 1 ]) 

4. I k <— 

5. for each j k s.t. A, fc fcJfc > 0 

6. Jk <— i &k(Ik-i,jk)', 

7. for each i k +i E T fc+1 (i [1>Jt] ) 

8. I k +i <— tyk+iih, U+i); 

9. Jt+i <— U+i); 

for each i K e T K (i[ ltK ^n) 

Ik <— 

Jk <— ^k(Jk-i,«k); 


10. 

11. 

12. 

13. 


Fig. 5.6. A better vector-matrix multiplication for a subset of the states . 


We employ this idea to derive algorithm RwSbs , shown in Fig. 5.6 and used in the solution algorithm A - 
JCB of Sect. 6. The statements I k <— >u) do not require a search, since all states i k e j) 

are enumerated sequentially in the for statement. A search is performed only to obtain J k <— '& k (J k _i,j k ) 
and, even if this is a multiplication by rows, the tests J k ^ null are necessary, as already discussed for RwSB 2 
and ClSb 2 - 


Statement J k 


^k{Jk-ujk) 


is performed 



xT k 


0(\T*\ * a fc ) times, hence 
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(a) mult 

(a) CPU 

(a) wall 

(b) mult 

(b) CPU 

(b) wall 

1 

22,638,318 

54.9 

178.2 

— 

— 

— 

Rw 

18,832,668 

38.6 

43.4 

— 

— 

— 

Cl 

18,832,668 

54.1 

54.2 

243,941,600 

1,488.2 

3,325.2 

RwSbi> RwSbi 

18,832,668 

156.5 

157.5 

23,199,652 

219.8 

220.2 

CISb 1 , CISbi 

18,832,668 

191.4 

191.9 

23,199,652 

259.8 

260.5 

RwSb2 f RwSb 2 

17,987,499 

56.3 

57.2 

20,980,675 

60.3 

61.8 

CISh , ClSb 2 

17,987,499 

91.1 

91.6 

20,980,675 

106.7 

106.9 

RwSbz, RwSbli 

15,714,589 

43.0 

45.2 

15,960,012 

24.4 

24.6 


Table 5.2 

Computational effort for vector-matrix multiplication. 


the complexity of RwSb 3 is: 

(5.5) O \T k \ ■ a k • logn*^ = 0(\T\ ■ a K • log n K ) 

(assuming that | T l K ~ l \ |T|). Thus, finally, we achieve a smaller log uk overhead with respect to ordinary 
multiplication, regardless of the type of sparsity. 

RwSb J in Fig. 5.6 is the simplified vector-matrix multiplication algorithm for matrices arising from a 
Kronecker sum. Also in this case the complexity is dominated by the innermost for-loop, where the 0 (log tik) 
search to compute Jk *— is performed r/[Ar,r] = 0(|T|*a) times. The complexity of RwSb J 

is then 


0(\T\ ■ a • log™*) 

regardless of k> and the resulting complexity of computing y *— y + x 

0(K * |T| • a - logn/c) , 

only a log rtx overhead with respect to ordinary multiplication. 

The complexity of CISbs and CISb 3 is the same as that of RwSb$ and RwSb% , although spurious 
entries are still a disadvantage. Unfortunately, though, CISb 3, unlike ClSb\ and ClSb 2 , does not compute 
the entries of y in order. This property prevents us from interleaving rows and column indices to reduce the 
logarithmic overhead if we want to use a Gauss-Seidel type iteration in our solution algorithm. 

5 . 6 . Comparing the multiplication algorithms. Fig. 5.7 compares the theoretical complexities of 
Shfft and Rw or Col, and of ordinary multiplication, assuming T — T and K = 4. While it is clear that Shfft 
is much superior for large values of a, the matrices W*(e) in typical modeling applications are ultrasparse 
or even hypersparse. In the region around a = 1, Shfft and Rw have the same performance, K times worse 
than ordinary multiplication. Indeed, Rw outperforms Shffl when a < 1, since it may recognize that an 
entire row of A is zero before reaching the innermost for-loop. 

On the other hand, Shffff and Rw + have exactly the same complexity as ordinary multiplication. In 
other words, when computing x * A where A is the Kronecker product of K matrices, all of them equal to an 
identity matrix except one, exploiting the Kronecker structure of A does not result in additional overhead 
(since the generic entry A*’ j of A is obtained without performing any multiplication) , nor in better efficiency 


* ( 0 fc=i r us ^ n S RwSbz is 
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Fig. 5.7. Comparing the complexity of Shffl, Rw, and ordinary multiplication (K = 4). 


(since no partial result can be reused the way Shffl does) . The same holds for the complexity of computing 
y y + x < A fc , except that ordinary multiplication is faster if there are many merged diagonal 

entries in ©j^ A k . In our application, the diagonals of the matrices R fc are zero, so this does not happen. 

These observations are confirmed by our running example. Table 5.2 gives the number of floating 
point multiplications performed by the algorithms we introduced and their execution times to compute 
x ■ R, or xx • Rr,Ti where R is given by Eq. (4.1). Columns labeled (a) consider the decomposition into 
two components, where T = T and R consists of three Kronecker products and one Kronecker sum, while 
columns labeled (b) refer to the second decomposition into four components, where \T\ |T| and R consists 

of seven Kronecker products and one Kronecker sum. The Kronecker sum in (a) contains more local events 
and more nonzero entries than the one in (b). We list both CPU and elapsed (wall) time in seconds, for a 
Sun SPARCstation 4 under SunOS 4.1.4, with a 85 MHz CPU, 64 MB main memory, and virtual memory 
limited to 398 MB. 

In the first decomposition, the matrices W*(e) are ultrasparse or hypersparse and, as predicted by 
our theoretical observations for a < 1, Rw outperforms Shffl, but all Kronecker-based algorithms are less 
computationally efficient than a conventional multiplication where R is stored in sparse format, which would 
only require 7][ R] = 13,439,073 multiplications. 

This suggests that, in practice, the real advantage of Kronecker-based methods lies exclusively in their 
large memory savings. 

In the second decomposition, r}[A^ r ] = 7?[At,t], hence one should expect no significant difference 
between row and column algorithms. However, as discussed in the following section, we use column algorithms 
only to allow a Gauss- Seidel type of iteration, where entries of the new iterate of the steady-state probability 
vector must be computed sequentially. Hence, in a multiplication by columns, we consider one state at a 
time, and all the events that lead to it, while, in a multiplication by row, we are free to consider one event 
at a time, and all the states it can affect. The latter way avoids having to switch between events, hence the 
row algorithms perform better. 

However, the column variants use less memory because the operand vector is directly overwritten by 
the results. This is the reason why, for the second decomposition, Cl can still execute while Shffl and Rw 
fail due to excessive memory requirements. However, Cl heavily relies on virtual memory, as the difference 
between CPU and elapsed times indicates. Cl considers 139,172,250 matrix entries in R, although only 
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7 /[R] = 13,439,073 are relevant. 

Shffl, Rw, and Cl are obviously inadequate when |T| » |T|; only algorithms based on T run acceptably 
fast for the second decomposition. The results indicate that their overhead is effectively reduced from RwSb\ 
to RwSb 2 to RwSb 3 , and from ClSb x to ClSb 2 ■ Clearly, there is no reason ever to use RwSbi or CISbi; we 
introduced them only as a stepping stone to the better algorithms. 

In s umm ary, RwSb 3 is a fast and robust algorithm, almost as fast as Rw even when [T| = |T|; it 
uses only 0(|7"|) memory, and makes full use of the multilevel data structure for the storage of T. For 
multiplication by columns, instead, Cl is considerably faster than ClSb 2 when |T| = |T|, but ClSb 2 is far 
superior when |T| » |T|, and it uses the least amount of memory in all cases. 

It should also be noted that RwSb 3 is faster with the second decomposition than with the first one, even 
if it performs slightly more operations. This is due to its log tik overhead: in the first decomposition, tik 
is much larger than in the second one (2394 vs. 126); indeed, 43.0/24.4 = 1.76 « log 2394/ log 126 = 1.61. 
Clearly, the model decomposition can greatly affect the overall solution time; how to reach an optimal choice 
is a topic for future research. 

We conclude this section by observing that we described our algorithms using K nested loops for il- 
lustration purposes only. Since K is model-dependent, a recursive implementation is required. To improve 
performance, we implemented this recursion iteratively with dynamically-allocated arrays of size K [20] . 


6 . Model solution algorithms. We now return to the problem of solving a Markov model, that is, 
Eq. (1.1). In practical modeling problems, Q is very large and indirect iterative numerical methods such as 
Power, Jacobi, or Gauss-Seidel are normally employed for the solution. In all cases, starting from a guess tt ' ' , 
which does not have to be the initial probability vector if the CTMC is ergodic, successive approximations 
7 r (m) are computed, until convergence is reached. 

In terms of the actual state space, the iterations we consider are: 

• Power method: 7 r (m+1) <- 7 r (m) ■ (I + Q h*), where h* is a value slightly larger than the maximum 
expected sojourn time in any state, maxo</<|T|{h/}. Element-wise, the Power method corresponds 
to: 

(6.1) VJe {0,1,... |T|- 1), <- 7r^ m) + [ */ m) ■ Q/,/ ) ' h *- 

Vo</<|T| J 


The Power method is guaranteed to converge in theory, but it is often extremely slow. 

• Jacobi method: 7 r( m+1 ) 4 — 7 r^ m ^ * R * diag( h). Element-wise, the Jacobi method corresponds to: 

(6.2) VJe {0,1,... |T| - 1}, 7T^ m+1) <- [ ?4 m) R/,J ) hj. 

\0</<|T|,/#J / 

The Jacobi method does not have guaranteed convergence, but it is usually faster than the Power 
method in practice. The Jacobi and Power methods coincide when all the sojourn times have the 
same value and, in the Power method, h* is set to this value instead of a value slightly larger. 

• Gauss-Seidel method: 7r( m+1 5 «- 7r (m) L-(dia 5 (h) _1 -U) _1 for forward Gauss-Seidel, or 7r( m+1) <- 

7 j-( m ) . u ■ — L)" 1 for backward Gauss-Seidel, where L and U are strictly lower and upper 

triangular matrices satisfying L + U = R. Element-wise, (forward) Gauss-Seidel corresponds to: 


(6.3) VJ e {0,1,... |T| — 1}, 7T 


(m+l) 


7 r 


(m+l) 


10 <I<J 


■ R i,j + 

•7<J<|T| 


7 r 


(m) 


R 




h j. 
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Gauss-Seidel does not have guaranteed convergence either, but it is guaranteed to be faster than 
the Jacobi method (if they converge), so it is considered the best among these three methods. Its 
convergence rate, however, is affected by the order in which the states are considered. 

Relaxation can be used to accelerate convergence [27]. We do not investigate this possibility, since it 
does not affect our discussion. Other iterative solution techniques, such as projection methods, have also 
been applied successfully for the analysis of Kronecker- based models. However, these techniques have higher 
memory requirements and, in any case, they too perform vector- matrix multiplications, so our discussion 
could be extended to them. For a detailed analysis of numerical techniques for the solution of Markov chains, 
see [26]. 

6.1. Alternative solution approaches. The first choice in a Kronecker-based solution is whether to 
use data structures of size |T| or \T\. Initial efforts have adopted the former approach [13, 14, 25], using 
a probability vector tt € 1R T \ initialized so that only states in T have nonzero probability (e.g., the initial 
state has probability one). This is required because, even if we assume that the CTMC is ergodic, that is, 
T is a single recurrent class, T might instead contain multiple recurrent classes. By ensuring that all the 
probability mass is in the class corresponding to T at the beginning of the iterations, we guarantee that this 
is true upon convergence as well. Entries fr* = 0 correspond to unreachable states and have no effect on the 
solution. 

Previous approaches, however, employ only the Power or Jacobi methods because they restrict themselves 
to accessing the matrix R by rows. As pointed out in Sect. 5, they compute the entries of a new iterate 
^(m+i) i ncremen tally, using the values of the previous iterate 7r^ m \ so that double-precision vectors should 
be used. 

The use of Gauss-Seidel requires instead computing tT|7 + V-i} before This can be accomplished 

if we have access to R by columns, that is, if we can efficiently obtain all the nonzero entries Rij, for a 
given j £ T and any i £ T. We have shown how to do this in Sect. 5. An additional advantage is that 
single-precision vectors can be used in this case. 

We now examine the timing requirements of the various solution algorithms, according to whether they: 

• Use the perfect shuffle approach, Stf, or our multiplication procedures. 

• Store vectors the size of the potential, P, or actual, A, state space. 

• Perform a Jacobi ( JCB ) or Gauss-Seidel (GSD) iteration. 

We indicate the resulting algorithms as SH-JCB [4, 15], P-JCB , A-JCB , P-GSD, and A-GSD. In the original 
SAN paper [24] introducing the Kronecker-based solution approach, the Power method is used instead of 
Jacobi. Thus, we present first the Jacobi method using function Shffl to realize the iteration. Fig. 6.1 and 
6.2 list only the statements within the main iteration of the numerical method, that is, the computation of 
a new iterate given the current one. 

We also compare the space used by the various algorithms, ignoring the memory needed to store the 
matrices R* and W*(e), which are necessary in all the algorithms we consider, and are in any case negligible 
compared to the storage for the required vectors. For simplicity, we assume that rate(e) for a synchronizing 
event e is equal one, and ignore it from now on; if its value were not one, we could simply incorporate it into 
exactly one of the matrices W fe (e), for some k (in a practical implementation, it is best to choose a k for 
which W*(e) £ I). 

An alternative to avoid storing R explicitly is simply to generate it “on-the-fly” at every iteration, 
directly from the high-model specification. While a Jacobi-style iteration is most natural, Deavours and 
Sanders [12] have shown how to use a variant of Gauss-Seidel in conjunction with an on-the-fly approach 
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Fig. 6.1. Algorithm SH-JCB. 


for a set of modeling formalisms including GSPNs, stochastic activity networks, and stochastic reward nets. 
A similar idea is also in [22], However, the time complexity of this approach is at least as high as that of 
the algorithms we present, and events requiring no time (e.g., immediate transitions in GSPNs [1, 2]) cause 
additional overhead, since entire paths, not single events, must be explored in this case to generate a single 
entry, so we do not consider it further. 

6.2. Algorithm SH-JCB. The algorithm SH-JCB in Fig. 6.1 implements the Jacobi method using 
Shffl and ShfflA for the vector- matrix multiplications. The time complexity of one SH-JCB iteration is 
independent of the submodel ordering: 

(6.4) O ( • ( t?[R fc ] + Y, ^ k (e)} 

\k = 1 \ e:W fe (e)^I 

Five vectors of length n are needed: one for the expected holding times, h, one each for the previous and 
the new iteration vectors, iz old and Tr new , plus two auxiliary vectors used when calling procedure Shffl, 7r aui1 
and 7 r atix2 , Additionally, two vector z and z' are needed in the procedures Shffl and ShffP , but they are only 
of size maxfc(n/t), much smaller than n . Of these, jr new } ?r aux2 , an( j z ' should be stored in double-precision, 
because they are used as accumulators. 

6.3. Algorithm P-JCB . P-JCB is the simplest iteration, it uses the Rw and Rw+ vector-matrix 
multiplications presented in Fig. 5.3 and, just as algorithm SH-JCB , it uses vectors of length n. Its complexity 
depends on the order of the components: 

O (f> • r,[K k ] + E E IP [W h (e)] 

\k=l e€£s k—l h=l 

P-JCB requires three vectors vectors of size n: n old , jt new , and h; only n new is used to accumulate 
sums. 




6.4. Algorithm A-JCB. A-JCB has the same convergence behavior of SH-JCB and P-JCB , but uses 
data structures of size \T\ by employing the RwSbz and RwSb^ vector-matrix multiplications presented in 
Fig. 5.6. The complexity of one A-JCB iteration is 

V f0 RfeN ) +Y r > (® wfc ( e )) J - iog n«r 
\fc=l / T,r / T,T_ / 
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Fig. 6.2. Algorithms P-JCB t A-JCB, P-GSD, and A-GSD. 


If the number of merged entries in the above expression is negligible, this simplifies to 

Ofa[R] *logn*:)> 

that is, just a lo gn K factor over the complexity of a traditional Jacobi iteration where R is stored explicitly. 
The memory requirements of A-JCB are the same as for P-JCB , except that vectors are now of size |T|, not 
n. 

6.5. Algorithm P-GSD . With the Gauss-Seidel method, the old and the new iterate can be stored into 
a single vector. If R were described by a single Kronecker product (S>k=i A k , P-GSD would be achieved by 
the simple call Cl(it, it), followed by the same elementwise multiplication of 7r by the expected 
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Procedure 

iteration holding time auxiliary search data 
vectors vector vectors structure 

SH-JCB 

P-JCB 

A-JCB 

P-GSD 

A-GSD 

n- (S+D) n- S n ■ (S+D) — 

n- (S+D) n- S — — 

\T\- (S+D) |T|- S — « \T\‘ L 

n- S n- S — — 

|T|- S |T|- S — in L 


Table 6.1 


Memory requirements for model solution algorithms. 


holding times, as performed by P-JCB. However, R consists of the sum of several Kronecker products, which 
can be processed sequentially in a Jacobi iteration, but not in a Gauss-Seidel iteration, since we must now 
complete the computation of the new 7 fj before starting that of ff t+ i. Hence, P-GSD must call the functions 
ClEl and ClEl + directly, not through Cl or Cl ‘ . 

The complexity of P-GSD is then the same as that of P-JCB. This makes it a better choice, since Gauss- 
Seidel has better convergence than Jacobi, and only one vector, ff, is required in addition to the expected 
holding times h. Furthermore, we can store tt in single-precision. 


6.6. Algorithm A-GSD. The comments made for P-GSD apply to A-GSD as well. As observed at 
the end of Sect. 5, the interleaving or rows and columns of ClSb 3 and ClSb^ cannot be used, so ClSb 2 
and ClSb% must be used instead, whose amortization of the logarithmic search is less effective. This points 
out a surprising tradeoff between A-JCB , which has slower convergence but a smaller overhead, log n x , and 
A-GSD, which has better numerical convergence but higher overhead, possibly as high as logn. 

The complexity of A-GSD is 


K 


O 


m-£ 


V T i X ’ 1 J h=k eeS s [\ft= 1 / Tf~ l xT k ,T*2 


■ log Tlfc 


1 


|7i fc l 


6.7. Comparing the model solution algorithms. Table 6.1 summarizes the memory requirements 
for the solution algorithms we considered, expressed in the units S and D, for a single- or double-precision 
floating point number (usually 4 and 8 bytes, respectively), and L, for a local state of Mk (usually 1 or 2 
bytes). The actual memory usage for our running example is instead in Table 6.2, for decompositions (a) 
and (b). Col umn “vectors” lists the memory (in bytes) for the iteration vectors and h; column “extra” for 
auxiliary vectors or search data structures. 

The timing results are in Table 6.3. We performed iterations using the absolute convergence criterion 
\\ir old _ 7r" eu, ||oo < 10 -8 , and a relaxation factor of 0.95. 

As already anticipated in Table 5.2, algorithms SH-JCB and P-JCB fail due to insufficient memory with 
the second decomposition, while P-GSD could be run, but with an unacceptable amount of overhead (we 
estimate it would require about six days of CPU time). 

We observe that the two decompositions result in different state orderings, which in turn affect the 
convergence of A-GSD. Hence, 200 iterations are required for the first decomposition, but only 150 for the 
second one (convergence is tested every 10 iterations). 
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Procedure 

(a) vectors (a) extra 

(b) vectors (b) extra 

SH-JCB 

P-JCB 

A-JCB 

P-GSD 

A-GSD 

24,974,208 18,730,656 

24,974,208 — 

24,974,208 3,127,000 

12,487,104 — 

12,487,104 3,127,000 

276,595,200 207,446,400 

276,595,200 

24,974,208 3,804,700 

138,297,600 -- 

12,487,104 3,804,700 


Table 6.2 

Memory requirements (in bytes) for our example. 


Procedure 

(a) CPU 

(a) wall 

(a) iter 

(b) CPU 

(b) wall 

(b) iter 

SH-JCB 

20,327 

65,940 

370 

— 

— 

— 

P-JCB 

14,292 

16,067 

370 

— 

— 

— 

A-JCB 

15,928 

16,725 

370 

10,022 

10,054 

370 

P-GSD 

10,822 

10,841 

200 

— 

— 

— 

A-GSD 

18,236 

18,337 

200 

20,464 

21,218 

150 


Table 6.3 

Execution times ( seconds ) and number of iterations for our example. 


7. Conclusion. We have presented a comprehensive set of Kronecker-based matrix- vector multipli- 
cation and solution algorithms for structured Markov models in a unified framework, which ignores the 
peculiarities of specific modeling formalisms. Time and space complexities are given, with special attention 
to the sparsity of involved matrices. 

We have shown how the Kronecker-based solution of structured Markov models can be carried on with 
smaller memory and execution complexity than previously proposed. This is achieved by exploiting the 
sparsity of the matrices involved in the Kronecker operations, by considering the actual state space instead 
of the potential state space (which can contain many unreachable states), by adopting a sophisticated data 
structure to determine whether a state is reachable or not, and by performing vector-matrix multiplications 
by rows or by columns, thus allowing the use of both Jacobi- and Gauss-Seidel-style methods. 

Our results are not limited to steady-state solution of ergodic models. Indeed, the computation of the 
cumulative sojourn time in the transient states up to absorption in an absorbing CTMC also requires the 
solution of (nonhomogeneous) linear system, while the iteration performed by the Uniformization method 
for the transient solution of a CTMC is essentially the same as that of the Power method. 

The proposed algorithms have been implemented in SupGSPN [20] for the Petri net formalism; imple- 
mentation of a more general software package fully supporting the state-dependent behavior we described 
is under way [7]. The reduced memory requirements allows us to solve very large Markov models (over 10 7 
states) on a modem workstation in a matter of hours. 

REFERENCES 

[1] M. Ajmone Marsan, G. Balbo, and G. Conte, A class of Generalized Stochastic Petri Nets for the 

performance evaluation of multiprocessor systems, ACM Trans. Comp. Syst., 2 (1984), pp. 93-122. 

[2] M. Ajmone Marsan, G. Balbo, G. Conte, S. Donatelli, and G. Franceschinis, Modelling 

with generalized stochastic Petri nets, John Wiley & Sons, 1995. 


23 






[3] P. Buchholz, Numerical solution methods based on structured descriptions of Markovian models , in 

Computer performance evaluation, G. Balbo and G. Serazzi, eds., Elsevier Science Publishers B.V. 
(North-Holland), 1991, pp. 251-267. 

[4] A class of hierarchical queueing networks and their analysis, Queueing Systems., 15 (1994), 

pp. 59-80. 

[5] G. CHIOLA, On the structural and behavioral characterization of P/T nets, in Proc. 5th Int. Workshop 

on Petri Nets and Performance Models (PNPM’93), Toulouse, France, Oct. 1993, IEEE Comp. Soc. 
Press. 

[6] G. Ciardo, A. Blakemore, P. F. J. Chimento, J. K. Muppala, and K. S. Trivedi, Automated 

generation and analysis of Markov reward models using Stochastic Reward Nets, in Linear Algebra, 
Markov C hains , and Queueing Models, C. Meyer and R. J. Plcmmons, eds., vol. 48 of IMA Volumes 
in Mathematics and its Applications, Springer- Verlag, 1993, pp. 145-191. 

[7] G. ClARDO AND A. S. Miner, SMART: Simulation and Markovian Analyzer for Reliability and Tim- 

ing, in Proc. IEEE International Computer Performance and Dependability Symposium (IPDS’96), 
Urbana-Champaign, IL, USA, Sept. 1996, IEEE Comp. Soc. Press, p. 60. 

[8] , Storage alternatives for large structured state spaces, in Proc. 9th Int. Conf. on Modelling Tech- 

niques and Tools for Computer Performance Evaluation, R. Marie, B. Plateau, M. Calzarossa, and 
G. Rubino, eds., LNCS 1245, Saint Malo, France, June 1997, Springer- Verlag, pp. 44 -57. 

[9] G. ClARDO AND M. Tilgner, On the use of Kronecker operators for the solution of generalized stochas- 

tic Petri nets, ICASE Report 96-35, Institute for Computer Applications in Science and Engineering, 
Hampton, VA, May 1996. 

[10] M. Davio, Kronecker products and shuffle algebra, IEEE Trans. Comp., C-30 (1981), pp. 116 125. 

[11] D. D. Deavours AND W. H. Sanders, An efficient disk-based tool for solving very large Markov 

models, in Proc. 9th Int. Conf. on Modelling Techniques and Tools for Computer Performance 
Evaluation, R. Marie, B. Plateau, M. Calzarossa, and G. Rubino, eds., LNCS 1245, Saint Malo, 
France, June 1997, Springer- Verlag, pp. 58 -71. 

[12] , “On-the-fly” solution techniques for stochastic Petri nets and extensions, in Proc. 7th Int. Work- 

shop on Petri Nets and Performance Models (PNPM’97), Saint Malo, France, June 1997, IEEE 
Comp. Soc. Press, pp. 132 141. 

[13] S. DONATELLI, Superposed Stochastic Automata: a class of stochastic Petri nets with parallel solution 

and distributed state space, Perf. Eval., 18 (1993), pp. 21-26. 

[14] , Superposed generalized stochastic Petri nets: definition and efficient solution, in Application 

and Theory of Petri Nets 1994, Lecture Notes in Computer Science 815 (Proc. 15th Int. Conf. on 
Applications and Theory of Petri Nets), R. Valette, ed., Zaragoza, Spain, June 1994, Springer- Verlag, 
pp. 258-277. 

[15] P. Fernandes, B. Plateau, and W. J. Stewart, Efficient descriptor-vector multiplication in 

stochastic automata networks, Rapport Apache (LGI, LMC) 12, 1994. 

[16] U. Herzog and M. RetteLBACH, eds., Proc. of the 2nd Workshop on Process Algebra and Perfor- 

mance Modelling (PAPM), Arbeitsberichte des IMMD 27 (4), 1994, IMMD, Universitat Erlangen. 

[17] R. A. Howard, Dynamic Probabilistic Systems, Volume II: Semi-Markov and Decision Processes, John 

Wiley and Sons, 1971. 

[18] P. KEMPER, Numerical analysis of superposed GSPNs, IEEE Trans. Softw. Eng., 22 (1996), pp. 615-628. 

[ 19 ] 5 Superposition of generalized stochastic Petri nets and its impact on performance analysis , PhD 


24 



REPORT DOCUMENTATION PAGE 

Form Approved 
OMB No. 0704-0188 

Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, 
gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this 
collection of information, including suggestions For reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson 
Davis Highway, Suite 1204, ArlingtonTvA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188), Washington, DC 20503 

1. AGENCY USE ONLY(Leave blank) 

2. REPORT DATE 

December 1997 

3. REPORT TYPE AND DATES COVERED 

Contractor Report 

4. TITLE AND SUBTITLE 

Complexity of Kronecker Operations on Sparse Matrices with Appli- 
cations to the Solution of Markov Models 

5. FUNDING NUMBERS 

C NAS 1-97046 
C NAS1-19480 
WU 505-90-52-01 

6. AUTHOR(S) 

Peter Buchholz, Gianfranco Ciardo, Susanna Donatelli, and Peter Kemper 

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 

Institute for Computer Applications in Science and Engineering 
Mail Stop 403, NASA Langley Research Center 
Hampton, VA 23681-0001 

8. PERFORMING ORGANIZATION 
REPORT NUMBER 

ICASE Report No. 97-66 

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 

National Aeronautics and Space Administration 
Langley Research Center 
Hampton, VA 23681-2199 

10. SPONSORING/MONITORING 
AGENCY REPORT NUMBER 

NAS A/CR-97-206274 
ICASE Report No. 97-66 

11. SUPPLEMENTARY NOTES 

Langley Technical Monitor: Dennis M. Bushnell 
Final Report 

To be submitted to INFORMS Journal on Computing 

12a. DISTRIBUTION/AVAILABILITY STATEMENT 

Unclassified -Unlimited 
Subject Category 60, 61 
Distribution: Nonstandard 
Availability: NASA-CASI (301)621-0390 

12b. DISTRIBUTION CODE 

13. ABSTRACT (Maximum 200 words) 

We present a systematic discussion of algorithms to multiply a vector by a matrix expressed as the Kronecker product 
of sparse matrices, extending previous work in a unified notational framework. Then, we use our results to define new 
algorithms for the solution of large structured Markov models. In addition to a comprehensive overview of existing 
approaches, we give new results with respect to: (I) managing certain types of state-dependent behavior without 
incurring extra cost; (2) supporting both Jacobi-style and Gauss-Seidel-style methods by appropriate multiplication 
algorithms; (3) speeding up algorithms that consider probability vectors of size equal to the “actual” state space 
instead of the “potential” state space. 

14. SUBJECT TERMS 

Kronecker algebra, Markov chains, vector-matrix multiplication 

15. NUMBER OF PAGES 

29 

16. PRICE CODE 

A03 

17. SECURITY CLASSIFICATION 
OF REPORT 

Unclassified 

18. SECURITY CLASSIFICATION 
OF THIS PAGE 

Unclassified 

19. SECURITY CLASSIFICATION 
OF ABSTRACT 

20. LIMITATION 
OF ABSTRACT 


klSN 7540-01-280-550^ Standard Form 298( Rev. 2-89) 

Prescribed by ANSI Std. 239-18 


298-102 
























thesis, Universitat Dortmund, 1996. 

[20] , SupGSPN Version 1.0 - an analysis engine for superposed GSPNs , tech, rep., Universitat Dort- 

mund, 1997. 

[21] , Reachability analysis based on structured representations , in Application and Theory of Petri 

Nets 1996, Lecture Notes in Computer Science 1091 (Proc. 17th Int. Conf. on Applications and 
Theory of Petri Nets, Osaka, Japan), J. Billington and W. Reisig, eds., Springer- Verlag, June 1999, 
pp. 269-288. 

[22] B. LUBACHEVSKY and D. Mitra, A chaotic asynchronous algorithm for computing the fixed point of 

nonnegative matrices with unit spectral radius , J. ACM, 33 (1986), pp. 130-150. 

[23] S. PlSSANETZKY, Sparse Matrix Technology , Academic Press, 1984. 

[24] B. PLATEAU, On the stochastic structure of parallelism and synchronisation models for distributed 

algorithms , in Proc. 1985 ACM SIGMETRICS Conf. on Measurement and Modeling of Computer 
Systems, Austin, TX, USA, May 1985, pp. 147-153. 

[25] B. Plateau AND K. Atif, Stochastic Automata Network for modeling parallel systems , IEEE Trans. 

Softw. Eng., 17 (1991), pp. 1093 1108. 

[26] W. J. Stewart, Introduction to the Numerical Solution of Markov Chains , Princeton University Press, 

1994. 

[27] W. J. Stewart and A. Goyal, Matrix methods in large dependability models , Tech. Rep. RC-11485, 

IBM T.J. Watson Res. Center, Yorktown Heights, NY, Nov. 1985. 


25 



